{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Building a Regression Model for a Financial Dataset\n",
    "\n",
    "In this notebook, you will build a simple linear regression model to predict the closing AAPL stock price. The lab objectives are:\n",
    "* Pull data from BigQuery into a Pandas dataframe\n",
    "* Use Matplotlib to visualize data\n",
    "* Use Scikit-Learn to build a regression model"
   ]
  },
  {
    "cell_type": "code",
    "metadata": {
      "id": "Nny3m465gKkY",
      "colab_type": "code",
      "colab": {}
    },
    "source": [
      "!sudo chown -R jupyter:jupyter /home/jupyter/training-data-analyst"
    ],
    "execution_count": null,
    "outputs": []
  },
  {
    "cell_type": "code",
    "metadata": {
      "id": "Nny3m465gKkY",
      "colab_type": "code",
      "colab": {}
    },
    "source": [
      "!pip install --user google-cloud-bigquery==1.25.0"
    ],
    "execution_count": null,
    "outputs": [
      {
       "name": "stdout",
       "output_type": "stream",
       "text": [
         "Collecting google-cloud-bigquery==1.25.0\n",
         "Downloading https://files.pythonhosted.org/packages/48/6d/e8f5e5cd05ee968682d389cec3fdbccb920f1f8302464a46ef87b7b8fdad/google_cloud_bigquery-1.25.0-py2.py3-none-any.whl (169kB)\n",
         "|████████████████████████████████| 174kB 3.2MB/s eta 0:00:01\n",
         "Requirement already satisfied: google-cloud-core<2.0dev,>=1.1.0 in /usr/local/lib/python3.5/dist-packages (from google-cloud-bigquery==1.25.0) (1.2.0)\n",
         "Requirement already satisfied: google-resumable-media<0.6dev,>=0.5.0 in /usr/local/lib/python3.5/dist-packages (from google-cloud-bigquery==1.25.0) (0.5.0)\n",
         "Requirement already satisfied: google-auth<2.0dev,>=1.9.0 in /usr/local/lib/python3.5/dist-packages (from google-cloud-bigquery==1.25.0) (1.10.1)\n",
         "Requirement already satisfied: protobuf>=3.6.0 in /usr/local/lib/python3.5/dist-packages (from google-cloud-bigquery==1.25.0) (3.11.2)\n",
         "Requirement already satisfied: google-api-core<2.0dev,>=1.15.0 in /usr/local/lib/python3.5/dist-packages (from google-cloud-bigquery==1.25.0) (1.16.0)\n",
         "Requirement already satisfied: six<2.0.0dev,>=1.13.0 in /usr/local/lib/python3.5/dist-packages (from google-cloud-bigquery==1.25.0) (1.14.0)\n",
         "Requirement already satisfied: setuptools>=40.3.0 in /usr/local/lib/python3.5/dist-packages (from google-auth<2.0dev,>=1.9.0->google-cloud-bigquery==1.25.0) (45.0.0)\n",
         "Requirement already satisfied: cachetools<5.0,>=2.0.0 in /usr/local/lib/python3.5/dist-packages (from google-auth<2.0dev,>=1.9.0->google-cloud-bigquery==1.25.0) (4.0.0)\n",
         "Requirement already satisfied: rsa<4.1,>=3.1.4 in /usr/local/lib/python3.5/dist-packages (from google-auth<2.0dev,>=1.9.0->google-cloud-bigquery==1.25.0) (4.0)\n",
         "Requirement already satisfied: pyasn1-modules>=0.2.1 in /usr/local/lib/python3.5/dist-packages (from google-auth<2.0dev,>=1.9.0->google-cloud-bigquery==1.25.0) (0.2.8)\n",
         "Requirement already satisfied: pytz in /usr/local/lib/python3.5/dist-packages (from google-api-core<2.0dev,>=1.15.0->google-cloud-bigquery==1.25.0) (2019.3)\n",
         "Requirement already satisfied: requests<3.0.0dev,>=2.18.0 in /usr/local/lib/python3.5/dist-packages (from google-api-core<2.0dev,>=1.15.0->google-cloud-bigquery==1.25.0) (2.22.0)\n",
         "Requirement already satisfied: googleapis-common-protos<2.0dev,>=1.6.0 in /usr/local/lib/python3.5/dist-packages (from google-api-core<2.0dev,>=1.15.0->google-cloud-bigquery==1.25.0) (1.51.0)\n",
         "Requirement already satisfied: pyasn1>=0.1.3 in /usr/local/lib/python3.5/dist-packages (from rsa<4.1,>=3.1.4->google-auth<2.0dev,>=1.9.0->google-cloud-bigquery==1.25.0) (0.4.8)\n",
         "Requirement already satisfied: urllib3!=1.25.0,!=1.25.1,<1.26,>=1.21.1 in /usr/local/lib/python3.5/dist-packages (from requests<3.0.0dev,>=2.18.0->google-api-core<2.0dev,>=1.15.0->google-cloud-bigquery==1.25.0) (1.24.2)\n",
         "Requirement already satisfied: idna<2.9,>=2.5 in /usr/local/lib/python3.5/dist-packages (from requests<3.0.0dev,>=2.18.0->google-api-core<2.0dev,>=1.15.0->google-cloud-bigquery==1.25.0) (2.8)\n",
         "Requirement already satisfied: chardet<3.1.0,>=3.0.2 in /usr/local/lib/python3.5/dist-packages (from requests<3.0.0dev,>=2.18.0->google-api-core<2.0dev,>=1.15.0->google-cloud-bigquery==1.25.0) (3.0.4)\n",
         "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.5/dist-packages (from requests<3.0.0dev,>=2.18.0->google-api-core<2.0dev,>=1.15.0->google-cloud-bigquery==1.25.0) (2019.11.28)\n",
         "Installing collected packages: google-cloud-bigquery\n",
         "Successfully installed google-cloud-bigquery-1.25.0 google-resumable-media-0.5.1\n",
         "WARNING: You are using pip version 19.3.1; however, version 20.2.3 is available.\n",
         "You should consider upgrading via the 'pip install --upgrade pip' command.\n"
       ]
      }
    ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Note**: Restart your kernel to use updated packages."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Kindly ignore the deprecation warnings and incompatibility errors related to google-cloud-storage."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bash\n",
    "\n",
    "bq mk -d ai4f\n",
    "bq load --autodetect --source_format=CSV ai4f.AAPL10Y gs://cloud-training/ai4f/AAPL10Y.csv"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%matplotlib inline\n",
    "\n",
    "import pandas as pd\n",
    "import matplotlib.pyplot as plt\n",
    "import numpy as np\n",
    "from sklearn import linear_model\n",
    "from sklearn.metrics import mean_squared_error\n",
    "from sklearn.metrics import r2_score\n",
    "\n",
    "plt.rc('figure', figsize=(12, 8.0))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Pull Data from BigQuery\n",
    "\n",
    "In this section we'll use a magic function to query a BigQuery table and then store the output in a Pandas dataframe. A magic function is just an alias to perform a system command. To see documentation on the \"bigquery\" magic function execute the following cell:"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The query below selects everything you'll need to build a regression model to predict the closing price of AAPL stock. The model will be very simple for the purposes of demonstrating BQML functionality. The only features you'll use as input into the model are the previous day's closing price and a three day trend value. The trend value can only take on two values, either -1 or +1. If the AAPL stock price has increased over any two of the previous three days then the trend will be +1. Otherwise, the trend value will be -1.\n",
    "\n",
    "Note, the features you'll need can be generated from the raw table `ai4f.AAPL10Y` using Pandas functions. However, it's better to take advantage of the serverless-ness of BigQuery to do the data pre-processing rather than applying the necessary transformations locally.  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%%bigquery df\n",
    "WITH\n",
    "  raw AS (\n",
    "  SELECT\n",
    "    date,\n",
    "    close,\n",
    "    LAG(close, 1) OVER(ORDER BY date) AS min_1_close,\n",
    "    LAG(close, 2) OVER(ORDER BY date) AS min_2_close,\n",
    "    LAG(close, 3) OVER(ORDER BY date) AS min_3_close,\n",
    "    LAG(close, 4) OVER(ORDER BY date) AS min_4_close\n",
    "  FROM\n",
    "    `ai4f.AAPL10Y`\n",
    "  ORDER BY\n",
    "    date DESC ),\n",
    "  raw_plus_trend AS (\n",
    "  SELECT\n",
    "    date,\n",
    "    close,\n",
    "    min_1_close,\n",
    "    IF (min_1_close - min_2_close > 0, 1, -1) AS min_1_trend,\n",
    "    IF (min_2_close - min_3_close > 0, 1, -1) AS min_2_trend,\n",
    "    IF (min_3_close - min_4_close > 0, 1, -1) AS min_3_trend\n",
    "  FROM\n",
    "    raw ),\n",
    "  train_data AS (\n",
    "  SELECT\n",
    "    date,\n",
    "    close,\n",
    "    min_1_close AS day_prev_close,\n",
    "    IF (min_1_trend + min_2_trend + min_3_trend > 0, 1, -1) AS trend_3_day\n",
    "  FROM\n",
    "    raw_plus_trend\n",
    "  ORDER BY\n",
    "    date ASC )\n",
    "SELECT\n",
    "  *\n",
    "FROM\n",
    "  train_data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "View the first five rows of the query's output. Note that the object `df` containing the query output is a Pandas Dataframe."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print(type(df))\n",
    "df.dropna(inplace=True)\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Visualize data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The simplest plot you can make is to show the closing stock price as a time series. Pandas DataFrames have built in plotting funtionality based on Matplotlib. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.plot(x='date', y='close');"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "You can also embed the `trend_3_day` variable into the time series above. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "start_date = '2018-06-01'\n",
    "end_date = '2018-07-31'\n",
    "\n",
    "plt.plot(\n",
    "    'date', 'close', 'k--',\n",
    "    data = (\n",
    "        df.loc[pd.to_datetime(df.date).between(start_date, end_date)]\n",
    "    )\n",
    ")\n",
    "\n",
    "plt.scatter(\n",
    "    'date', 'close', color='b', label='pos trend', \n",
    "    data = (\n",
    "        df.loc[df.trend_3_day == 1 & pd.to_datetime(df.date).between(start_date, end_date)]\n",
    "    )\n",
    ")\n",
    "\n",
    "plt.scatter(\n",
    "    'date', 'close', color='r', label='neg trend',\n",
    "    data = (\n",
    "        df.loc[(df.trend_3_day == -1) & pd.to_datetime(df.date).between(start_date, end_date)]\n",
    "    )\n",
    ")\n",
    "\n",
    "plt.legend()\n",
    "plt.xticks(rotation = 90);"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "df.shape"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Build a Regression Model in Scikit-Learn"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In this section you'll train a linear regression model to predict AAPL closing prices when given the previous day's closing price `day_prev_close` and the three day trend `trend_3_day`. A training set and test set are created by sequentially splitting the data after 2000 rows. "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "features = ['day_prev_close', 'trend_3_day']\n",
    "target = 'close'\n",
    "\n",
    "X_train, X_test = df.loc[:2000, features], df.loc[2000:, features]\n",
    "y_train, y_test = df.loc[:2000, target], df.loc[2000:, target]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Create linear regression object. Don't include an intercept,\n",
    "# TODO"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Train the model using the training set\n",
    "# TODO"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Make predictions using the testing set\n",
    "# TODO"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Print the root mean squared error of your predictions\n",
    "# TODO"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Print the variance score (1 is perfect prediction)\n",
    "# TODO"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# Plot the predicted values against their corresponding true values\n",
    "# TODO"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The model's predictions are more or less in line with the truth. However, the utility of the model depends on the business context (i.e. you won't be making any money with this model). It's fair to question whether the variable `trend_3_day` even adds to the performance of the model:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "print('Root Mean Squared Error: {0:.2f}'.format(np.sqrt(mean_squared_error(y_test, X_test.day_prev_close))))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Indeed, the RMSE is actually lower if we simply use the previous day's closing value as a prediction! Does increasing the number of days included in the trend improve the model? Feel free to create new features and attempt to improve model performance!"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.5.3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
